Here we compute the transcription factor and pathway activity of the 5-FU treated human in-vitro data. This will give us an overview of the relevant pathways and TFs activated and responding to the drug, depending on time and concentration.
There is a problem in DE analysis: some genes with no expression gets very high log2FC. Check the human_SI_transcriptomics_issue script to see details.
We found that if we filter out the genes that have zero expression across more than one replica the issue is resolves.
First we import all the transcriptomics data of SI treated with 5-FU and Colon treated with 5-FU. This data was already preprocessed, i.e. we removed the genes where the log2 FC was artificially high, while the expression was zero.
transcriptomics_data <- import_human_transcriptomics_preprocessed_data()
sample_table <- transcriptomics_data %>% select(concentration,sample_id,organ,time) %>% unique()
We use the Dorothea regulon, which contains the HGNC symbols of transcription factors and transcription factor targets. Therefore we need to map the Ensembl gene ids to HGNC. We use BiomaRt for this.
R length(tr_genes) ensembl genes, we found R tr_genes_foundR tr_genes_notfoundList of missing, but significantly changing genes:
# We are missing the following ENSEMBL ids:
missing_genes = tr_genes[!tr_genes %in% genes_dic$ensembl_gene_id]
# around 10-15 genes are significantly different between conditions:
transcriptomics_data %>%
filter(ensembl_gene_id %in% missing_genes) %>%
filter(padj<0.05) %>%
arrange(padj)
## # A tibble: 870 x 12
## fileID concentration time ensembl_gene_id baseMean log2FoldChange lfcSE
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1000u… 1000 72 ENSG00000284526 406. -3.85 0.190
## 2 1000u… 1000 72 ENSG00000272405 304. -3.28 0.166
## 3 1000u… 1000 48 ENSG00000284526 406. -3.39 0.183
## 4 1000u… 1000 72 ENSG00000238164 206. -4.79 0.264
## 5 1000u… 1000 48 ENSG00000238164 206. -4.21 0.234
## 6 1000u… 1000 72 ENSG00000225138 186. -4.55 0.256
## 7 1000u… 1000 72 ENSG00000249673 185. -3.58 0.209
## 8 1000u… 1000 48 ENSG00000272405 304. -2.45 0.147
## 9 1000u… 1000 48 ENSG00000249673 185. -3.22 0.197
## 10 1000u… 1000 48 ENSG00000225138 186. -3.33 0.204
## # … with 860 more rows, and 5 more variables: stat <dbl>, pvalue <dbl>,
## # padj <dbl>, organ <chr>, sample_id <chr>
R n_missing_HGCN genes found by EnsemblID has no HGCN name.
add the gene names to the transcription data
transcriptomics_data <- transcriptomics_data %>%
left_join(genes_dic, by = "ensembl_gene_id")
Some Ensembl id were not found, we removed those genes and also those that have no gene names. Averaged those transcripts that have the same gene name.
Check how many dorothea regulons are in the data
regulons %>%
filter(target %in% dorothea_data$hgnc_symbol) %>%
nrow() %>%
print()
## [1] 4875
nrow(regulons)
## [1] 6620
4873 of the 6620 TF target transcripts are found in the data.
Run VIPER to estimate TF activity in samples. We run the analysis on log2 FC, therefore we get TFs that are strongly up/down regulated in treatment vs control.
# data frame with the original data
viper_input = dorothea_data_hgnc %>%
spread(sample_id,log2FC) %>%
column_to_rownames("hgnc_symbol")
rerun = FALSE
if(rerun){
tf_activities <- dorothea::run_gviper(viper_input, regulons,
options = list(minsize = 5, eset.filter = FALSE,
cores = 1, verbose = TRUE, nes = TRUE),tidy = TRUE)
# attach the description of the samples
tf_activities <- left_join(tf_activities,
rename(sample_table,sample = sample_id),
by = "sample")
if(FALSE) write_rds(tf_activities,"./data/results/tf_activity_gviper_all_human.rds")
}else{
tf_activities = read_rds("./data/results/tf_activity_gviper_all_human.rds")
}
Example:
net_e2f2 <- regulons %>% filter(tf == "E2F2") %>%
dplyr::rename(from = "tf", to = "target")
net_e2f2_nodes <- tibble(id = unique(c(net_e2f2$from,net_e2f2$to))) %>%
mutate(label = id) %>%
#mutate(color = ifelse(id %in% net_e2f2$from, "red","green")) %>%
mutate(hgnc_symbol = id) %>%
left_join(filter(dorothea_data_hgnc,sample_id =="colon_1000uM_24h"), by= "hgnc_symbol") %>%
mutate(size = 5 + 5*(abs(log2FC) - min(abs(log2FC),na.rm = TRUE))) %>%
mutate(size = ifelse(is.na(size),5,size)) %>%
mutate(color = ifelse(is.na(log2FC), "grey", ifelse(sign(log2FC) == 1,"blue","red") ))
visNetwork::visNetwork(nodes = net_e2f2_nodes, edges = net_e2f2)%>%
visNetwork::visEdges(arrows = 'to', scaling = list(min = 2, max = 2),color = "mor" )
We take the 5 top TF’s from each conditions and check which are the genes that were used to determine their activities:
top_5_tf = tf_activities %>%
group_by(sample) %>%
top_n(5, wt = abs(activity)) %>%
arrange(activity) %>% pull(tf) %>% unique()
regulons %>%
filter(tf %in% top_5_tf) %>%
inner_join(dorothea_data %>% rename(target = hgnc_symbol),by = "target") %>%
left_join(sample_table, by = c("concentration", "time", "organ", "sample_id")) %>%
group_by(sample_id,tf) %>%
mutate(lfc_rank = n()-rank(abs(log2FoldChange))) %>%
mutate(label = ifelse(padj<0.05 & lfc_rank<5, target, "" ) ) %>%
arrange(desc(abs(log2FoldChange))) %>%
ggplot(aes(log2FoldChange,-log10(pvalue))) +
geom_point(aes(color=padj<0.05)) +
ggrepel::geom_text_repel(aes(label=label)) +
facet_grid(organ+concentration+time~tf)